Explore Red Wine Quality by Dewi Octavia

In this project, a data set of red wine quality will be explored based on its physicochemical properties using the statistical software, R. The objective is to find physicochemical properties that distinguish good quality wine from lower quality ones. An attempt to build linear model on wine quality will also be shown.

Summary Statistics

Firstly, I would like to understand the data set structure. Summary and str functions were used for this purpose. This data set consists of 1599 observations with 11 physicochemical properties as input variables and quality as the output. Wine quality is an ordered and discrete variable, the quality ranges from 3.0 to 8.0, with mean and median of 5.6 and 6.0, respectively. Each observation is identified in X variable. From the data set description, there is a pair of subset (dependant) variables that is free sulfur dioxide to total sulfur dioxide.

Univariate Plots Section

After the first look of the data set, I will now plot those variables in histogram to have a quick glance of the distribution.

## [1] "=== Stats Summary of pH ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## [1] "=== Stats Summary of Density ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

pH and density appear to be normally distributed. The normal distribution is confirmed by almost equal mean and median values. Other variables are mostly long-tailed with a few outliers. I will replot the long-tailed distributions in log scale and compare it to its original plot along with their stats summaries.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The distribution of fixed acidity in log scale seems to be more normal.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Distribution of volatile acidity in log scale is also more normal, however it still looks slightly skewed.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] "Number of zero values"
## 
## FALSE  TRUE 
##  1467   132

In the original plot, there is suspiciously high count of zero in citric acid. I wonder if this is truly zero or simply a ‘not available’ value. A quick check using table function shows that there are 132 observations of zero values and no NA value in reported citric acid concentration. The citric acid concentration could be too low and insignificant hence was reported as zero. Replotting citric acid distribution in log scale does not help normalizing the distribution, it could be due to the high count of zero values mentioned above.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar distribution is more normal in log scale however it is still long-tailed.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Distribution of log(chlorides) is more normal than distribution of chlorides.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Replotting free sulfur dioxide distribution in log scale shows a bimodal distribution behaviour.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Distribution of total sulfur dioxide in log scale is more normal.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Distribution of log(sulphates) is normal.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Replotting alcohol in log scale does not normalize the distribution.

## [1] "=== Stats Summary ==="
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Distribution of quality in log scale is more skewed than its original scale. In the original plot, we can see that the count of mid range quality (quality of 5 and 6) is considerably higher than others. This might become an issue when comparing low and high quality wines.

Wine quality ranges from 3.0 to 8.0 in this data frame. Since I am more inclined in investigating what makes a higher quality wine, I will add a new variable quality.rating to categorise quality values of 3.0-4.0 as ‘bad’, 5.0-6.0 as ‘average’, and 7.0-8.0 as ‘good’.

rw$quality.rating <- ifelse(rw$quality <5.0, 'bad', 'average')
rw$quality.rating <- ifelse(rw$quality >6.0, 'good',rw$quality.rating)
rw$quality.rating <- ordered(rw$quality.rating, levels = c('bad','average','good'))

rw[1:20, 13:14]
##    quality quality.rating
## 1        5        average
## 2        5        average
## 3        5        average
## 4        6        average
## 5        5        average
## 6        5        average
## 7        5        average
## 8        7           good
## 9        7           good
## 10       5        average
## 11       5        average
## 12       5        average
## 13       5        average
## 14       5        average
## 15       5        average
## 16       5        average
## 17       7           good
## 18       5        average
## 19       4            bad
## 20       6        average

The distribution of quality rating is much higher on the average rating wine as seen in quality distribution. This is likely to cause overplotting therefore I will be comparing only the bad and good wines to find distinctive properties that separate these two.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

The feature of interest is wine quality.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Based on the data description given, I suspect acids, residual sugar and sulfur dioxide will have effect on the taste hence wine quality.

Did you create any new variables from existing variables in the dataset?

Quality.rating variable is created to group the wine quality into three ratings, bad, average and good.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

pH and density appear to be normally distributed. Other variables are mostly long-tailed with a few outliers. At this stage, I have not change format of the data.

Bivariate Plots Section

In this section, I will look into comparing two variables and see if there is any correlation between them. I have used ggpairs function to spot obvious pattern between variables.

To better visualise the strong correlation between variables, I will use corrplot function

# Exclude variable X in correlation calculation
rw.noX <- subset(rw, select =- X)

rw.corr <- cor(rw.noX[sapply(rw.noX, is.numeric)])

corrplot(rw.corr, method = "number")

The ggpairs and corrplot functions above highlight correlation between two variables. I will only be looking into variables with correlation coefficient above 0.30 and below -0.30. Let’s put those data wrangling skill to use and show those variables in a new data frame!

#Re-arrange the correlation table
rw.corr.melt <- melt(rw.corr)

rw.corr.melt <- subset(rw.corr.melt, 
                       value <=-0.30 | value >=0.30) 

rw.corr.melt <- subset(rw.corr.melt, 
                       value != 1.0) 

rw.corr.melt <- arrange(rw.corr.melt, 
                        desc(value))


# delete the repeated correlation
rw.corr.melt <- rw.corr.melt[-seq(2, 
                                  nrow(rw.corr.melt), 
                                  by =2),]

rw.corr.melt$value <- round(rw.corr.melt$value, 3)

rw.corr.melt
##                      X1                  X2  value
## 1           citric.acid       fixed.acidity  0.672
## 3               density       fixed.acidity  0.668
## 5  total.sulfur.dioxide free.sulfur.dioxide  0.668
## 7               quality             alcohol  0.476
## 9             sulphates           chlorides  0.371
## 11              density         citric.acid  0.365
## 13              density      residual.sugar  0.355
## 15            sulphates         citric.acid  0.313
## 17                   pH             density -0.342
## 19              quality    volatile.acidity -0.391
## 21              alcohol             density -0.496
## 23                   pH         citric.acid -0.542
## 25          citric.acid    volatile.acidity -0.552
## 27                   pH       fixed.acidity -0.683

Now we have the list of variable pairs, let’s plot them!

Fixed acidity has a strong positive correlation to citric acid. In the description of data attributes, fixed acidity is defined as ‘most acids involved with wine or fixed or nonvolatile (do not evaporate easily)’. I am wondering if this means citric acid is part of fixed acidity. If it is, other variables that correlates well to fixed acidity will also show some correlation to citric acid. A quick peek into rw.corr.melt data frame seems to have proven this finding. I will discuss more about this as I plot the rest of the graphs.

The plot above shows strong positive correlation between fixed acidity and density. If our previous suspicion is true, we will also see some correlation between density and citric acid.

The citric acid is indeed correlated to density. Even though the correlation is not as strong as fixed acidity to density, the linear regression line seems to show some linear relationship between the two variables. Now let’s find other variable that is correlated to fixed acidity and compare it to citric acid.

pH is very well correlated to both fixed acidity and citric acid. On second thought, the strong correlation of pH and density to both acids could just be common physical properties of acids. The strong correlation we saw in citric acid and fixed (tartaric) acid could be the result of both acids being predominant fixed acids found in wine grapes Nierman (2004).

From the plot, we can see an clear relationship between free sulfir dioxide and total sulfur dioxide. This can confirm free sulfur dioxide being subset of total sulfur dioxide.

This plot is rather interesting. There is overcrowding in quality of 5 and 6 due to higher number of mid range quality wine in data set. However if we compare the low quality (3-4) to high quality (7-8), there is a trend of increasing alcohol content from low to high wine quality.

Sulphates and chlorides seems to have some correlation however it is rather poor.

As sugar solution is denser than water, it is expected to see increasing density as residual sugar concentration increases. The plot above shows rather weak correlation between these two variables.

There is a weak positive correlation between citric acid and sulphates.

This plot shows red wine with lower pH tends to have higher density.

The plot above shows negative correlation between volatile acidity and wine quality. High quality red Wines have lower volatile (acetic) acid.

It is expected that decreasing density as alcohol content increases in the plot. In fermentation process, sugar is turned into alcohol. The more alcohol produced, the less sugar remains hence lower density.

Volatile acidity shows strong negative correlation to citric acid. The strong correlation can be explained by subsequent conversion of citric acid to acetic acid in a wine making practice involving malo-lactic bacterium Shimazu et al. (1985)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

Variables that show correlation to quality are alcohol and volatile acidity. Alcohol content has positive correlation to wine quality. On the other hand, volatile acidity is negatively correlated to quality.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

  • Three acids variables appear to be correlated. As citric acid increases, fixed acidity increases and volatile acidity decreases. Citric acid can be seen negatively correlated to volatile (acetic) acid because citric acid is subsequently converted to acetic acid in wine making process. Both fixed (tartaric) and volatile (acetic) acids are present in wine grape, this could be the reason these two variable are positively correlated.
  • Among three acids, volatile acidity does not have as strong correlation to pH and density compared to fixed acidity and citric acid.
  • Residual sugar has an expected correlation to density. Since sugar is denser than water solution, density increases as residual sugar increases.
  • Chlorides seems to be somehow correlated to sulphates.
  • Free sulfur dioxide is strongly correlated to total sulfur dioxide as they are dependent (subset) pair variables.
  • Density is strongly correlated to fixed acidity. A quick check in Wikipedia reveals tartaric acid has higher density than water.
  • Alcohol content shows negative correlation to density. In fermentation process, sugar is converted to alcohol. The more alcohol produced, the less sugar remains hence lower density.

What was the strongest relationship you found?

pH to fixed acidity has the strongest relationship, which makes sense as pH is the scale to measure acidity. This is followed by fixed acidity to citric acid, fixed acidity to density and free sulfur dioxide to total sulfur dioxide.

I would also like to see if any particular variable in log scale has stronger correlation to quality. The comparison will be shown in a new data frame, ‘df.corr’.

##                      Correlation to quality
## fixed.acidity                          0.12
## volatile.acidity                      -0.39
## citric.acid                            0.23
## residual.sugar                         0.01
## chlorides                             -0.13
## free.sulfur.dioxide                   -0.05
## total.sulfur.dioxide                  -0.19
## density                               -0.17
## pH                                    -0.06
## sulphates                              0.25
## alcohol                                0.48
## quality                                1.00
##                      Correlation (log scale) to quality
## fixed.acidity                                      0.11
## volatile.acidity                                  -0.39
## citric.acid                                         NaN
## residual.sugar                                     0.02
## chlorides                                         -0.18
## free.sulfur.dioxide                               -0.05
## total.sulfur.dioxide                              -0.17
## density                                           -0.18
## pH                                                -0.06
## sulphates                                          0.31
## alcohol                                            0.48
## quality                                            1.00

The new data frame above shows that transforming sulphates into log scale improves its correlation to quality. The same is observed in chlorides however its correlation to wine quality is not as strong as sulphates. Sulphates and chlorides will be converted to log scale for the rest of analysis.

I will now compare the original and log variables to quality in a graph (to put writing user-defined function into practice). The graphs will be plotted side-by-side for comparison, with graph on right handside in log scale. The grid is splitted into three separate plots for better visuality.

From the comparison between original and log scale plots, transforming sulphates and chlorides seems to improve the correlation slightly. This also reflected when distribution of sulphates and chlorides were replotted in Univariate Plots Section. Distribution of log(sulphates) and log(chlorides) appeared more normal than their original plots.

Multivariate Plots Section

In this section, I will mostly focus on variables that are well correlated to quality. They are alcohol content, volatile acidity and sulphates.

This agrees with data attributes description that high level of volatile acid gives unpleasant, vinegar taste hence low wine quality. Also, good wines have higher alcohol content than bad wines.

Negative correlation between alcohol and density is consistent in all three ratings. The plot also shows that while holding density constant, bad rating wine has lower alcohol content compared to good rating wine. It is good to see change of slope steepness as the wine rating gets better.

This plot shows better wine tends to have higher sulphates concentration. The range of sulphate concentration for a certain wine rating seems to be narrow.

Good wines have lower volatile (acetic) acid than bad and average wines. pH does not seem to affect quality rating. From volatile acidity below 0.5 g/L, we can see that better wines have lower pH when holding volatile acidity constant.

According to its correlation coefficient, citric acid is mildly correlated to wine quality. However at low volatile acidity (<0.6 g/L), there is a trend of better wine comes with higher citric acid concentration.

Linear Model

To build the linear predicting model, I will be using variables with highest correlation to wine quality.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates), 
##     data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) + 
##     citric.acid, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) + 
##     citric.acid + total.sulfur.dioxide, data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log(sulphates) + 
##     citric.acid + total.sulfur.dioxide + density, data = rw)
## 
## ===========================================================================================
##                            m1         m2         m3         m4         m5          m6      
## -------------------------------------------------------------------------------------------
##   (Intercept)            1.875***   3.095***   3.369***   3.444***   3.658***    1.627     
##                         (0.175)    (0.184)    (0.184)    (0.196)    (0.201)    (12.006)    
##   I(alcohol)             0.361***   0.314***   0.303***   0.303***   0.290***    0.292***  
##                         (0.017)    (0.016)    (0.016)    (0.016)    (0.016)     (0.020)    
##   volatile.acidity                 -1.384***  -1.156***  -1.217***  -1.176***   -1.181***  
##                                    (0.095)    (0.097)    (0.112)    (0.112)     (0.116)    
##   log(sulphates)                               0.641***   0.659***   0.671***    0.669***  
##                                               (0.077)    (0.079)    (0.078)     (0.080)    
##   citric.acid                                            -0.113     -0.075      -0.085     
##                                                          (0.103)    (0.103)     (0.119)    
##   total.sulfur.dioxide                                              -0.002***   -0.002***  
##                                                                     (0.001)     (0.001)    
##   density                                                                        2.021     
##                                                                                (11.948)    
## -------------------------------------------------------------------------------------------
##   R-squared                 0.227      0.317      0.345      0.346      0.353      0.353   
##   adj. R-squared            0.226      0.316      0.344      0.344      0.351      0.351   
##   sigma                     0.710      0.668      0.654      0.654      0.650      0.651   
##   F                       468.267    370.379    280.646    210.808    174.115    145.012   
##   p                         0.000      0.000      0.000      0.000      0.000      0.000   
##   Log-likelihood        -1721.057  -1621.814  -1587.752  -1587.153  -1578.057  -1578.043   
##   Deviance                805.870    711.796    682.108    681.597    673.886    673.874   
##   AIC                    3448.114   3251.628   3185.503   3186.306   3170.114   3172.085   
##   BIC                    3464.245   3273.136   3212.389   3218.569   3207.754   3215.102   
##   N                      1599       1599       1599       1599       1599       1599       
## ===========================================================================================

R-squared from the model is rather low, this could be due to lack of variable that shows strong correlation to wine quality. The predicting model seems to fit better to average rating wine, this may be caused by larger distribution of average wine in data set.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

Low volatile acidity combined with high alcohol content and sulphates seem to make better wines.

Were there any interesting or surprising interactions between features?

Negative correlation between alcohol and density is consistent in all three ratings. The plot also shows bad rating wine has lower alcohol content compared to good rating wine.


Final Plots and Summary

Plot One: Density vs. Alcohol Content by Wine Rating

Description One

This plot shows the influence of alcohol content and density to wine rating. Negative correlation between alcohol and density is consistent in all three wine ratings. This negative correlation can be explained by the fermentation process in wine making. Sugar content is directly proportional to density, higher sugar content leads to higher density. In fermentation process, sugar is converted to alcohol. The more alcohol produced, the less sugar remains hence lower density. The change of slope steepness as the wine rating gets better is shown as expected. The plot also shows that while holding density constant, bad rating wine has lower alcohol content compared to good rating wine. The average rating wine data is ignored in inferring relationship between variables due to significantly higher number of average rating data.

Plot Two: Sulphates vs. Alcohol Content by Wine Rating

Description Two

This chart reveals influence of alcohol and sulphates concentration to red wine rating. It shows that better wines tend to have higher sulphates and alcohol concentrations. The range of sulphate concentration for a certain wine rating seems to be small.

Plot Three: Linear Predicting Model

Description Three

With R-squared score of 35.3%, the linear predicting model does not help explain the variance in wine quality. Although the model generated shows better correlation to average rating wine, this could be due to high number of average rating wine in data set or missing other key properties that better predict wine quality.


Reflection

In this project, I was able to examine relationship between physicochemical properties and identify the key variables that determine red wine quality, which are alcohol content and volatile acidity. Some interesting findings of relationship between variables was made sensible using scientific explanation such as relationship between alcohol content, residual sugar and density in wine making process. Data wrangling skill was put into practice in this project for rearranging data into a suitable format. Lack of variable that shows strong correlation to wine quality and high distribution of average rating wine proved to be problematic in performing analysis. It was hard to tell if a true correlation was present. This also shows limitation in generating an accurate predicting model. For future data exploration, it will be interesting to apply different approach in building the algorithm and look into evaluations made by each wine experts as wine tasting is subject to individual preferences.